Credit Card Fraud Detection¶

In this project we will look at the credit card fraud data set provided by datacamp.com. First, we will clean the data by checking for missing values and duplicates and deal with these issues accordingly. Then we will visualize the data to see if we can spot any patterns. Finally, we will use the data to create a model that will predict instances of credit card fraud.

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

ccf = pd.read_csv('credit_card_fraud.csv')
ccf.head()
Out[2]:
trans_date_trans_time merchant category amt city state lat long city_pop job dob trans_num merch_lat merch_long is_fraud
0 2019-01-01 00:00:44 Heller, Gutmann and Zieme grocery_pos 107.23 Orient WA 48.8878 -118.2105 149 Special educational needs teacher 1978-06-21 1f76529f8574734946361c461b024d99 49.159047 -118.186462 0
1 2019-01-01 00:00:51 Lind-Buckridge entertainment 220.11 Malad City ID 42.1808 -112.2620 4154 Nature conservation officer 1962-01-19 a1a22d70485983eac12b5b88dad1cf95 43.150704 -112.154481 0
2 2019-01-01 00:07:27 Kiehn Inc grocery_pos 96.29 Grenada CA 41.6125 -122.5258 589 Systems analyst 1945-12-21 413636e759663f264aae1819a4d4f231 41.657520 -122.230347 0
3 2019-01-01 00:09:03 Beier-Hyatt shopping_pos 7.77 High Rolls Mountain Park NM 32.9396 -105.8189 899 Naval architect 1967-08-30 8a6293af5ed278dea14448ded2685fea 32.863258 -106.520205 0
4 2019-01-01 00:21:32 Bruen-Yost misc_pos 6.85 Freedom WY 43.0172 -111.0292 471 Education officer, museum 1967-08-02 f3c43d336e92a44fc2fb67058d5949e3 43.753735 -111.454923 0

Clean Data¶

In this section we will check for missing values and duplicates and deal with them accordingly.

In [4]:
print(ccf.isna().sum())
print('Number of duplicates: ' + str(ccf.duplicated().sum()))
trans_date_trans_time    0
merchant                 0
category                 0
amt                      0
city                     0
state                    0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64
Number of duplicates: 0

Our data has no missing values and no duplicates.

Add age and hour columns¶

It is possible that credit card fraud could correlate with the age of the card holder and with the hour of the day of the transaction so we will add columns for these feature.

In [7]:
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['dob'] = pd.to_datetime(ccf['dob'])
ccf['age_at_trans'] = ((ccf['trans_date_trans_time'] - ccf['dob']).dt.days / 365.25).astype(int)
ccf['hour_of_trans'] = ccf['trans_date_trans_time'].dt.hour

print(ccf['trans_date_trans_time'].dtype)
print(ccf['dob'].dtype)
print(ccf['age_at_trans'].head())
print(ccf['hour_of_trans'].unique())
datetime64[ns]
datetime64[ns]
0    40
1    56
2    73
3    51
4    51
Name: age_at_trans, dtype: int32
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]

The output shows the dates were converted to the correct type and the age and hour columns were successfully created.

Exploring the data¶

First we filter the data to contain only the fraudulent transactions

In [11]:
frauds = ccf[ccf['is_fraud'] == 1]
frauds_by_cat = frauds['category'].value_counts()

Fraud by merchant category:¶

Fraud may occur at different rates between merchant types. We can plot the number of instances of fraud by merchant type.

In [13]:
sns.barplot(frauds_by_cat)
plt.title('Number of Frauds vs. Merchant Category')
plt.xticks(rotation=90)
plt.xlabel('Merchant Category')
plt.ylabel('Number of Credit Card Frauds')
plt.show()
No description has been provided for this image

The top 5 merchant categories for credit card fraud are in store grocery shopping, online shopping, miscillaneous online purchases, in person shopping, and gas for transportation.

Fraud by transaction amount¶

We can plot the number of instances of fraud that occur by transaction amount to understand how often fraud occurs for different transaction amounts.

In [16]:
sns.histplot(frauds['amt'], bins=[50 * i for i in range(30)])
plt.title('Number of Frauds vs. Transaction Amount')
plt.xlabel('Transaction Amount')
plt.ylabel('Number of Credit Card Frauds')
plt.show()
No description has been provided for this image
In [17]:
print('Number of frauds less than $50: ' + str(len(frauds[frauds['amt'] <= 50])))
print('Number of frauds between than $200 and $400: ' + str(len(frauds[(frauds['amt'] >= 200) & (frauds['amt'] <= 400)])))
print('Number of frauds between than $600 and $1200: ' + str(len(frauds[(frauds['amt'] >= 600) & (frauds['amt'] <= 1200)])))
Number of frauds less than $50: 390
Number of frauds between than $200 and $400: 483
Number of frauds between than $600 and $1200: 801

We can see that there are three distinct ranges at which fraud most commonly occurs. There were 390 cases between 0 and 50 dollars, 483 cases between 200 and 400 dollars, and 801 cases between 600 and 1200 dollars.

Fraud by time of transaction:¶

It is possible that fraud is more likely to occur at different times of day. The plot below visualizes the number of frauds that occur by the hour of the day.

In [20]:
sns.countplot(frauds, x='hour_of_trans')
plt.title('Number of Frauds vs. Hour of Day')
plt.xlabel('Hour of day')
plt.ylabel('Number of Credit Card Frauds')
plt.show()
No description has been provided for this image

From the plot it is clear that the vast majority of fraud takes place between 10pm and 4am.

Fraud by age group:¶

Fraud may occur at different rates in different age groups. We can plot the number of frauds that occur in different age groups to visualize the relationship.

In [23]:
sns.histplot(frauds['age_at_trans'], bins=[15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95])
plt.xticks([15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95])
plt.title('Number of Frauds vs. Age of Card Holder')
plt.xlabel('Age of Credit Card Holder')
plt.ylabel('Number of Credit Card Frauds')
plt.show()

print('Mean age: ' + str(round(frauds['age_at_trans'].mean(), 1)), 'Standard Deviation: ' + str(round(frauds['age_at_trans'].std(), 1)))
No description has been provided for this image
Mean age: 50.2 Standard Deviation: 18.3

We see that the age of victims of credit card frauds is approximately normally distributed with a mean of 50.2 years and standard deviation of 18.3 years.

Fraud by geographic location¶

It is possible that fraud occurs at different rates in different locations. To vizualise this relationship we can find the proportion of all transaction in each state that are fraudulent and then color code each state by its fraud rate.

In [26]:
import plotly.express as px
trans_and_fraud_by_state = ccf.groupby('state')['is_fraud'].value_counts()

fraud_rates = {}
i = 0
while i < len(trans_and_fraud_by_state) - 1:
    state_key = trans_and_fraud_by_state.index.get_level_values(0)[i]
    fraud_rates[state_key] = 100 * trans_and_fraud_by_state.iloc[i + 1] / (trans_and_fraud_by_state.iloc[i + 1] + trans_and_fraud_by_state.iloc[i])
    i = i + 2

fig = px.choropleth(locationmode='USA-states', scope='usa', locations=fraud_rates.keys(), color=fraud_rates.values(), title='Fraud Rate by US State')
fig.update_layout(coloraxis_colorbar_title_text='Fraud Rate (%)')
fig.show()

We see that in our data set the three states with highest percentage of fraudelent transactions are Alaska, Oregon, and Nebraska. All of the states in our data seem to have similar fraud rates which fall in the range of 0.5% - 1.7%.

Classification Model¶

We can now use the the features of the data explored above to create a prediction model. Since credit card fraud is a binary outcome (either fraud or no fraud) we can use a logistic regression. Logistic regression has several assumptions that need to be met. First, the observations need to be independent. This assumption is met by our data since each row represents a different transaction and there are no duplicates. Second, there must be little to no multicollinearity between the features we choose. To check this we will verify that our features are not highly correlated.

First, we have to encode our categrorical merchant variabel as a numeric variable. Then we will check for correlation among our features.

In [30]:
features_df = pd.get_dummies(ccf, columns=['category'])
corr = features_df.drop(columns=['trans_date_trans_time', 'merchant', 'city', 'state', 'job', 'dob', 'trans_num', 'merch_lat', 'merch_long', 'is_fraud']).corr()
sns.heatmap(corr, cmap="YlGnBu", cbar_kws={'label': 'Correlation Value'})
plt.title('Feature Correlation')
plt.show()
No description has been provided for this image

We see that the correlation between most of our features is low, but there is some correlation (max/min correlation of around 0.25 and -0.35) between a small subset of our features. Specifically, there is some correlation between the hour of the transaction and the category of the transaction, which is expected.

Logistic Regression¶

We can now split our data into training and testing sets and fit a logistic regression model to the training data. Since fraudulent transactions make up only a small proprotion of transactions, there will be class imbalance that we will have to deal with. According to data from the Federal Reserve, it is estimated that credit card fraud makes up about 1% of all purchases (source: https://wallethub.com/edu/cc/credit-card-fraud-statistics/25725#:~:text=Credit%20card%20fraud%20only%20impacts,the%20biggest%20concerns%20among%20consumers.). Therfore, to account for the class imbalance in our data we will set class weights of 0.01 and 0.99 for the legitimate transaction and fraudulent transaction classes respectivley.

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X = features_df.drop(columns=['trans_date_trans_time', 'merchant', 'city', 'state', 'job', 'dob', 'trans_num', 'merch_lat', 'merch_long', 'is_fraud'])
y = features_df['is_fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

standard_scaler = StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)

logreg = LogisticRegression(random_state=33, class_weight={0: 0.01, 1: 0.99})
logreg.fit(X_train_standard, y_train)

y_pred = logreg.predict(X_test_standard)
y_pred_proba = logreg.predict_proba(X_test_standard)[::,1]

Model Evaluation¶

To evaluate how effective our model is at predicting fraud we will look at the confusion matrix to vizualize how many true/false negatives/positives are model predicts, and we will look at the ROC curve, specifically the AUC metric, to understand how our model performs.

In [35]:
from sklearn.metrics import confusion_matrix

conf_mtrx = confusion_matrix(y_test, y_pred)

axis_labels = ['0: Legitimate', '1: Fraud']

sns.heatmap(pd.DataFrame(conf_mtrx), annot=True, cmap='YlGnBu', fmt='d', xticklabels=axis_labels, yticklabels=axis_labels)
plt.title('Actual vs. Predicted Transaction Legitimacy')
plt.xlabel('Predicted Legitimacy')
plt.ylabel('Actual Legitimacy')
plt.show()
No description has been provided for this image

Our model correctly predicts 323 actual fraudulent transactions with 119 false negatives, 2450 false positives, and 82010 true negatives.

In [37]:
from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test,  y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)

print('AUC Score: ' + str(auc))

plt.plot(fpr,tpr,label="Weighted Logistic Regression, auc="+str(auc))
plt.title('ROC')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
AUC Score: 0.9048428772408798
No description has been provided for this image

Our model has an AUC score of about 0.9 which leads us to believe it is effective at discriminating between fradulent and legitimate transactions.

Conclusion¶

To sum up, in this project we looked at a dataset containing credit card transactions and attempted to build a model that classifies a transaction based on whether it is fraudulent or not. We started by exploring and vizualizing our data to see how fraud was related to the other features of the data set. We then used these features as variables in a weighted logistic regression. After fitting our model to the data we found that the model correctly predicted 323 cases of actual fraud with 119 false negative predictions, 2450 false positive predictions, and 82010 true negative predictions. Finally, we looked at the ROC curve and found an AUC score of 0.904 which tells us that our model is effective at distinguishing between fraudulent and legitimate transactions.